Search Engines笔记 - Information Needs

CMU 11642 的课程笔记。这一篇概括了用户信息需求的分类、查询语句的结构以及查询的前期处理过程(非结构化的查询语句->结构化的查询语句)。

Query type

  • Informational(39%)
    像 iphones 之类,用户想了解一个 topic。
  • Transactional(36%)
    像购物、买机票之类,用户想找个网站进行交易,但是并没有特定的 destination.
  • Navigational(25%)
    像 CMU 网站之类的,用户有一个特定的想要浏览的 location/destination

Query language

一条标准的 query 分为 3 部分。

  • Source of information: fields, XML elements, metadata
  • Query operators: AND, OR, NEAR/n, …
  • Rules: 怎样使用这些 operators (顺序、权重等)

每一条 query 都会被转化成一个结构化的查询语句。

Query operators

1
2
3
4
5
6
7
Boolean operators: AND, OR, AND-NOT
Distance operators: NEAR/n, WINDOW/n, SENTENCE/n, PARAGRAPH/n
Extent(field) restrictions: BODY, TITLE, INLINK, ABSTRACT, AUTHOR,...
Comparison operators: <, >, BEFORE, AFTER, ...
Score operators: WEIGHT, AVERAGE, MAX, MIN, ...
Synonym
Filter-And-Rank(q1,q2): q1 forms a set, use q2 ranks it

Query Processing

查询处理,这里最常用的是 #NEAR 和 #SYNONYM,对于一些词组(phrases),搜索引擎会用 #NEAR 进行规范化,如

1
2
3
die-cast -> #NEAR/1 (die cast)
virginia beach -> #NEAR/1 (virginia beach)
barack obama -> #NEAR/3 (barack obama)

对于一些缩写,或者拼写错误,一般会用 #SYNONYM 进行调整,如

1
2
3
4
5
6
# Abbreviations
virginia -> (virginia,va)
# Spelliing correction:
brittany -> britney
brittany -> #SYNONYM (brittany,britney)

Query Reformulation

Sequential-Dependency Models(SDM) 会将非结构化的查询转化成结构化的查询语句,一个 SDM query 分为三部分:

  • Bag of words matches
    作用是保证能找到东西。eg. #AND(q1,q2…qn)
  • N-gram matches (ordered,phrase-like)
    给匹配的 n-gram 提供了额外的权重。eg. #NEAR/1(q1,q2) #NEAR/1(q2,q3)…#NEAR/1(qn-1,qn)
  • Short window matches (unordered, sentence-like)
    给匹配的窗口提供了额外的权重。eg. #WINDOW/8(q1,q2)…#WINDOW/8(qn-1,qn)

Eg.

1
2
3
4
5
6
7
8
User Query:
sherwood regional library
A sequential dependency model query:
#wand(
0.5 #and( sherwood regional library )
0.25 #and( #near/1( regional library ) #near/1( sherwood regional ) )
0.25 #and( #window/8( regional library ) #window/8( sherwood regional ) ) )

Perl 代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
#!/usr/bin/perl
#
# Perl subroutine that generates Indri dependence model queries.
#
# Written by: Don Metzler (metzler@cs.umass.edu)
# Last update: 06/27/2005
#
# Feel free to distribute, edit, modify, or mangle this code as you see fit. If you make any interesting
# changes please email me a copy.
#
# For more technical details, see:
#
# * Metzler, D. and Croft, W.B., "A Markov Random Field Model for Term Dependencies," ACM SIGIR 2005.
#
# * Metzler, D., Strohman T., Turtle H., and Croft, W.B., "Indri at TREC 2004: Terabyte Track", TREC 2004.
#
# * http://ciir.cs.umass.edu/~metzler/
#
# MODIFICATIONS
# - Updated by Jamie Callan: 02/11/2015
# Modified to support a less cryptic Indri-like query language.
# #combine --> #and, #1 --> #near/1, #weight --> #wand, and #uw --> #window/
#
# NOTES
#
# * this script assumes that the query string has already been parsed and that all characters
# that are not compatible with Indri's query language have been removed.
#
# * it is not advisable to do a 'full dependence' variant on long strings because of the exponential
# number of terms that will result. it is suggested that the 'sequential dependence' variant be
# used for long strings. either that, or split up long strings into smaller cohesive chunks and
# apply the 'full dependence' variant to each of the chunks.
#
# * the unordered features use a window size of 4 * number of terms within the phrase. this has been
# found to work well across a wide range of collections and topics. however, this may need to be
# modified on an individual basis.
#
# example usage
print formulate_query( "sherwood regional library", "sd", 0.02, 0.49, 0.49 ) . "\n\n";
#print formulate_query( "sherwood regional library", "fd", 0.8, 0.1, 0.1 ) . "\n\n";
#
# formulates a query based on query text and feature weights
#
# arguments:
# * query - string containing original query terms separated by spaces
# * type - string. "sd" for sequential dependence or "fd" for full dependence variant. defaults to "fd".
# * wt[0] - weight assigned to term features
# * wt[1] - weight assigned to ordered (#near) features
# * wt[2] - weight assigned to unordered (#window) features
#
sub formulate_query {
my ( $q, $type, @wt ) = @_;
# trim whitespace from beginning and end of query string
$q =~ s/^\s+|\s+$//g;
my $queryT = "#and( ";
my $queryO = "#and(";
my $queryU = "#and(";
# generate term features (f_T)
my @terms = split(/\s+/ , $q);
my $term;
foreach $term ( @terms ) {
$queryT .= "$term ";
}
my $num_terms = @terms;
# skip the rest of the processing if we're just
# interested in term features or if we only have 1 term
if( ( $wt[1] == 0.0 && $wt[2] == 0.0 ) || $num_terms == 1 ) {
return $queryT . ")";
}
# generate the rest of the features
my $start = 1;
if( $type eq "sd" ) { $start = 3; }
for( my $i = $start ; $i < 2 ** $num_terms ; $i++ ) {
my $bin = unpack("B*", pack("N", $i)); # create binary representation of i
my $num_extracted = 0;
my $extracted_terms = "";
# get query terms corresponding to 'on' bits
for( my $j = 0 ; $j < $num_terms ; $j++ ) {
my $bit = substr($bin, $j - $num_terms, 1);
if( $bit eq "1" ) {
$extracted_terms .= "$terms[$j] ";
$num_extracted++;
}
}
if( $num_extracted == 1 ) { next; } # skip these, since we already took care of the term features...
if( $bin =~ /^0+11+[^1]*$/ ) { # words in contiguous phrase, ordered features (f_O)
$queryO .= " #near/1( $extracted_terms) ";
}
$queryU .= " #window/" . 4*$num_extracted . "( $extracted_terms) "; # every subset of terms, unordered features (f_U)
if( $type eq "sd" ) { $i *= 2; $i--; }
}
my $query = "#wand(";
if( $wt[0] != 0.0 && $queryT ne "#and( " ) { $query .= " $wt[0] $queryT)"; }
if( $wt[1] != 0.0 && $queryO ne "#and(" ) { $query .= " $wt[1] $queryO)"; }
if( $wt[2] != 0.0 && $queryU ne "#and(" ) { $query .= " $wt[2] $queryU)"; }
if( $query eq "#wand(" ) { return ""; } # return "" if we couldn't formulate anything
return $query . " )";
}

另外常用的模型还有 query expansion

徐阿衡 wechat
欢迎关注:徐阿衡的微信公众号
客官,打个赏呗~